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ABSTRACT 

Motivation: Large multiple genome alignments and inferred ancestral 
genomes are ideal resources for comparative studies of molecular 
evolution, and advances in sequencing and computing technology 
are making them increasingly obtainable. These structures can pro- 
vide a rich understanding of the genetic relationships between all sub- 
sets of species they contain. Current formats for storing genomic 
alignments, such as XMFA and MAF, are all indexed or ordered 
using a single reference genome, however, which limits the informa- 
tion that can be queried with respect to other species and clades. This 
loss of information grows with the number of species under compari- 
son, as well as their phylogenetic distance. 

Results: We present HAL, a compressed, graph-based hierarchical 
alignment format for storing multiple genome alignments and ancestral 
reconstructions. HAL graphs are indexed on all genomes they contain. 
Furthermore, they are organized phylogenetically, which allows for 
modular and parallel access to arbitrary subclades without fragmen- 
tation because of rearrangements that have occurred in other lin- 
eages. HAL graphs can be created or read with a comprehensive 
C++ API. A set of tools is also provided to perform basic operations, 
such as importing and exporting data, identifying mutations and co- 
ordinate mapping (liftover). 

Availability: All documentation and source code for the HAL API and 
tools are freely available at http://github.com/glennhickey/hal. 
Contact: hickey@soe.ucsc.edu or haussler@soe.ucsc.edu 
Supplementary information: Supplementary data are available at 
Bioinformatics online. 
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1 INTRODUCTION 

A DNA (or protein) sequence alignment groups together pos- 
itions in the sequences they contain that are homologous (related 
by descent). Positions within the same sequence can be homolo- 
gous via duplication events. The multiple alignment problem is 
NP-hard, but tools have been developed to produce large (tens of 
thousands of sequences) (Notredame, 2007) accurate alignments, 
provided the input sequences are relatively short and conserved, 
such as gene exons. Whole-genome alignment is much more dif- 
ficult not only because of the increased length of the input but 
also because of the presence of large spans of non-conserved 
sequence (Blanchette et ah, 2004; Paten et ah, 2011). Changes 
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because of large-scale rearrangement events, such as inversions, 
segmental duplications and transpositions, must also be taken 
into account in addition to point mutations and small insertions 
and deletions (indels). 

These challenges of creating whole-genome alignments carry 
over to their representation and analysis. Gapped matrices that 
are traditionally used for gene alignments become fragmented 
into blocks by rearrangements or excessive divergence. These 
blocks are stored by current formats as ASCII lines containing 
coordinates and DNA strings for each sequence in the alignment. 
As blocks can only be ordered with respect to a single reference 
row, performing disk-based queries using non-reference coordin- 
ates is extremely inefficient because of fragmentation, even if 
external indexes were to be constructed for them. The hierarch- 
ical alignment (HAL) graph structure and tool set described later 
in the text were designed to address this issue, while adding 
support for file compression. 

2 METHODS 

2.1 HAL format 

HAL's design was guided by two observations: (i) breakpoint graphs are 
the most natural way of representing genome rearrangements (Pevzner 
and Tesler, 2003) and (ii) progressive alignment (based on a phylogenetic 
decomposition) has been the most successful heuristic for multiple se- 
quence alignment (Notredame, 2007) and is likely to remain so for 
whole-genome alignment. A HAL graph, therefore, decomposes a mul- 
tiple alignment into a set of pairwise alignments, which are represented as 
breakpoint graphs. Each pairwise alignment corresponds to a branch of a 
rooted phylogenetic tree. In the absence of a tree, a reference genome can 
be used as a root with all other genomes as leaves. 

A genome in HAL is represented by up to three arrays (Fig. 1A): 
a sequence array, a top segment array if the genome has an ancestor in 
the tree and a bottom segment array if the genome has one or more 
descendants in the tree. For each branch in the phylogenetic tree, edges 
in the HAL graph connect bottom segments from the ancestral genome 
to top segments in the descendant genome (Fig. IB). These edges de- 
fine the pairwise alignments between the ancestral genome and each 
of its descendants. The amount of segmentation along a branch is, there- 
fore, determined by the number of unique breakpoints in these pair- 
wise alignments, including those induced by indels. Each top (resp. 
bottom) segment S is assigned a parse edge connecting it to the bot- 
tom (resp. top) segment of the same genome that overlaps the first base 
of 5 on the DNA sequence. Paralogs (duplicated) regions of the genome 
are represented by sets of top segments that share an ancestor. Inversions 
are represented as flags on edges between segments. Chromosomes or 
scaffolds are represented as contiguous subranges of the genome. 

Provided the graph is in memory, all segments and DNA bases can be 
looked up by their array index, and all edges can be traversed, in O(l) 
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Fig. 1. (A) A single genome as represented in HAL. Two sequences are 
stored in an array of DNA characters and are segmented with respect to 
its parent (top segments) and children (bottom segments). (B) The same 
genome in the context of HAL graph of five genomes. The dashed edge 
corresponds to an inversion event 



Table 1. HAL tools summary 



Tool 


Description 


halStats 


Print summary statistics of HAL file 


halSummarizeM utations 


Print mutation summary for given subgraph 


halBranchMutations 


Generate BED file(s) of mutations for a branch 


halLiftover 


Map BED coordinates between genomes 


hal2maf/maf2hal 


Convert to and from MAF 


cactus2hal 


Convert from Cactus 



reference genome. Excellent software packages are available for sorting, 
combining and querying BED files (Quinlan and Hall, 2010; Neph et al., 
2012) and can be combined with the aforementioned tools to create 
powerful analysis pipelines for multiple genome alignments. 



time. Locating the segment that contains a particular DNA position 
within a given chromosome requires O(logm) time, where m is the 
number of segments. Heuristics are used to further speed this operation 
up in practice. An arbitrary amount of metadata can be added to each 
genome. Missing data are stored as segments of 'N' characters with no 
edges. Supplementary Sections SI and S3 describe how HAL offers simi- 
lar compression to gzip and vastly reduced query times when compared 
with Multiple Alignment Format (MAF). 

2.2 HAL API 

A comprehensive C++ API is provided to create and query HAL files, 
which are presently stored in Hierarchical Data Format (HDF5). HDF5 
is a longtime standard, supported on all major platforms, for storing large 
matrix data and is optimized for efficient indexing, caching and compres- 
sion (The HDF5 Group, 2000-2010). Data within HDF5 can be quickly 
and randomly indexed, and decompression and caching is abstracted 
from the user. This allows for efficient external memory algorithm 
design, a requirement for multi-genome analysis. We note that the differ- 
ent back ends could be added in the future. 

HAL graphs are accessed by segment iterators, which traverse the 
graph through its native structure of segments and edges, and column 
iterators, which dynamically transpose the graph (or desired subgraph) 
to a traditional matrix block/format to iterate across alignment columns. 

2.3 HAL tools 

A set of command line utilities, summarized in Table 1 , are provided to 
create and analyze multiple genome alignments in HAL. Importers are 
provided for UCSC's MAF, which is a standard with its own rich set of 
filters and converters (ex. to FAST A) (Blanchette et al., 2004) and Cactus 
(Paten et al, 201 1), which has been designed specifically to output HAL. 
MAF files can be quickly produced from HAL graphs for given sub- 
graphs with respect to arbitrary references to be compatible with existing 
browsers and tools. The memory usage of each tool is configurable via its 
command line options. 

Mutations can be identified along branches and output to tab 
delimited annotation files using the halBranchMutations tool. A cycle 
decomposition of the breakpoint graph structure allows rearrangements, 
such as duplications, inversions and transpositions to be reported in add- 
ition to substitutions, insertions and deletions. Small indels (determined 
by a provided threshold) can be nested within larger rearrangements to 
avoid overcounting in these cases. Patterns of conservation within a 
target sequence can be aggregated using the halLiftover tool, which 
maps coordinates in a BED file to an arbitrary target in the alignment. 
This utility provides a general strategy to efficiently liftover and project 
any comparative genomics information into the coordinate system of any 



3 CONCLUSION 

We have presented HAL, a data format, API and set of tools for 
storing and analyzing genome alignments and ancestral recon- 
structions. The key features of HAL are its indexing, which 
allows fast coordinate mapping between arbitrary subsets of gen- 
omes, and its graph structure, which facilitates analysis of genome 
rearrangements, as well as modular decomposition into clades. 
The compression and chunking capabilities of HDF5 are lever- 
aged to keep I/O and memory usage to a minimum. All of these 
properties, in particular the ability to parallelize by clade, will be 
necessary for alignments that arise from current large-scale 
sequencing projects, such as Genome 10K (Haussler et al., 
2009) (more details in Supplementary Section S2). 
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